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Abstract 

Background: Recent advances in rapid, low-cost sequencing have opened up the opportunity to study complete 
genome sequences. The computational approach of multiple genome alignment allows investigation of 
evolutionarily related genomes in an integrated fashion, providing a basis for downstream analyses such as 
rearrangement studies and phylogenetic inference. 

Graphs have proven to be a powerful tool for coping with the complexity of genome-scale sequence alignments. The 
potential of graphs to intuitively represent all aspects of genome alignments led to the development of graph-based 
approaches for genome alignment. These approaches construct a graph from a set of local alignments, and derive a 
genome alignment through identification and removal of graph substructures that indicate errors in the alignment. 

Results: We compare the structures of commonly used graphs in terms of their abilities to represent alignment 
information. We describe how the graphs can be transformed into each other, and identify and classify graph 
substructures common to one or more graphs. Based on previous approaches, we compile a list of modifications that 
remove these substructures. 

Conclusion: We show that crucial pieces of alignment information, associated with inversions and duplications, are 
not visible in the structure of all graphs. If we neglect vertex or edge labels, the graphs differ in their information 
content. Still, many ideas are shared among all graph-based approaches. Based on these findings, we outline a 
conceptual framework for graph-based genome alignment that can assist in the development of future genome 
alignment tools. 



Background 

Sequence comparison through multiple alignment is an 
indispensable tool for understanding genomes and their 
shared histories [1]. Even though the foundation for 
genomic sequence alignment was already laid in the 1980s 
[2], the interest is still ongoing [1,3>4], one reason being 
that it has critical relevance [5] for many bioinformatics 
analyses. The aim of sequence alignment is to uncover 
homologies by assigning sequence positions to each other, 
which implies that these positions derived from a com- 
mon ancestor. 

Evolutionary events that change genomic sequences are 
often classified into small changes and large structural 
changes [6] . Small changes affect only one or few sequence 
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positions and include substitutions, insertions, and dele- 
tions. They do not influence the order of sequence posi- 
tions, and thus can be captured by colinear alignment. 
Structural changes involve longer genomic segments, 
thereby affecting the structure and order of genomic 
sequences. They include non- colinear changes like inver- 
sions, translocations and duplications in addition to inser- 
tions and deletions of longer segments. 

While colinear multiple sequence alignment has been 
studied extensively for a long time [7-16], the problem 
of non-colinear alignment has been brought into focus 
only within the last decade [17-22], after more and more 
whole genomes started to become available. Non-colinear 
alignments, as opposed to colinear alignments, model all 
kinds of evolutionary changes and thereby enable cor- 
rect homology prediction for whole genomes with non- 
colinear changes. This is comparable to the way global 
alignments integrate more information than local align- 
ments by assigning all parts of sequences to each other, 
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and the way multiple alignments take information from 
more than two sequences into account for homology pre- 
diction. Over and above, non-colinear multiple global 
alignments of whole genomes, genome alignments for 
short, integrate as much sequence similarity information 
as is available. 

Together with the prediction of homology, genome 
alignments provide a segmentation of the genomes orig- 
inating from large structural changes. Depending on the 
similarity of genomes, segments can be shorter or span 
several genes and reveal local colinearity. Rearrangement 
studies [23] explore the order of such segments and infer 
genomic distances based on the number of breakpoints 
[24,25] or predict scenarios of evolutionary changes 
[26-28]. These studies often employ graphs, e.g., break- 
point graphs [29-31], that resemble graph data structures 
used for genome alignment. Despite this similarity in the 
approach, genome alignments pursue a slightly different 
goal than rearrangement studies. The goal is homol- 
ogy prediction instead of reconstruction of evolutionary 
histories. Genome alignments, which are the focus of 
this article, integrate more information than rearrange- 
ment studies by combining segmentation and sequence 
similarity. 

Considering the large search space, genome alignment 
is an ambitious task and is usually accomplished using 
heuristic approaches. The first step in genome alignment 
is commonly the computation of a set of local align- 
ments. It is essential for most methods that the set of local 
alignments covers all main genomic similarities, whereas 
additional spurious similarities have a smaller impact. In 
colinear alignment, such a set usually constitutes a super- 
position of several alignment possibilities with some local 
alignments in conflict regarding the colinearity constraint 
(see Figure 1). The task is then to select the best conflict- 
free subset according to a given optimization function. In 
genome alignment, as opposed to colinear alignment, any 
set of local alignments can be viewed as a valid solution, 
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Figure 1 Alternative alignments of the sequences CATCGA and 
CCGATA. The alignment on the left is colinear if the dinucleotides AT 
(red) are interpreted as insertion or deletion. Alternatively, the AT 
dinucleotides can be aligned and the CG dinucleotides interpreted as 
insertion or deletion. Non-colinear aligners that allow for 
translocations may align the AT dinucleotides in addition to the CG 
dinucleotides. The alignment on the right shows a non-colinear 
alternative that interprets the four nucleotides ATCG as inversion 
(reverse complement). In this example, we expect non-colinear 
aligners to prefer the inversion (right) over the translocation (left) 
since it creates fewer segments. 



one that induces a segmentation. However, the induced 
segmentation can be improved by selecting a subset of 
local alignments. The subset should contain those local 
alignments that are most likely to represent homologies 
when viewed in the context of the whole set of local 
alignments. The final step is then to find the best segmen- 
tation according to the set of local alignments and possibly 
a subsequent realignment of segments with a colinear 
alignment method. 

For the step of selecting subsets of local alignments 
and for inducing a segmentation, graphs serve as a con- 
venient tool. The idea is that graphs show substructures 
indicating errors in the alignment, e.g., specific cycles. 
Once identified in a graph, we can eliminate these sub- 
structures, e. g., by removing local alignments, which is 
a modification of the genome alignment. Thus, graphs 
can assist in improving genome alignments. In addition, 
graphs provide an intuitive representation of similarities 
and changes between genomes, and so visualize alignment 
structures. In comparison to tabular alignments, genome 
alignment graphs are more versatile insofar that it is pos- 
sible to model colinear and non-colinear changes without 
the need of choosing a reference genome. 

Several graphs have been proposed, each in the con- 
text of a specific application such as synteny detection, 
segmentation, or simply colinear alignment. The earliest 
graph has been the alignment graph, formally defined for 
colinear multiple alignment by Kececioglu in 1993 [32]. 
In his definition, the graph contains a vertex for each 
sequence character and edges for aligned characters. The 
alignment graph has since been used in various versions, 
e. g., with additional sequence edges [33] and with genes 
[34] or segments [15] instead of single characters. In all 
versions, a colinear alignment can be obtained from the 
alignment graph by solving the maximum weight trace 
problem [32], but its structure also allows non-colinear 
changes to be modeled (see below). 

Pevzner et al. introduced A-Bruijn graphs [35] as a gen- 
eralization of de Bruijn graphs [36,37]. The structure of 
A-Bruijn graphs revisits an idea briefly mentioned by 
Kececioglu [32], the idea of merging aligned vertices. Con- 
sequently, A-Bruijn graphs have one vertex for sets of 
aligned positions, and edges represent sequence adjacen- 
cies. For the purpose of genome alignment with A-Bruijn 
graphs, the maximum subgraph with large girth (MSLG) 
problem [19] and the sequence modification problem 
(SMP) [38] were proposed, both targeting types of short 
cycles in A-Bruijn graphs in order to eliminate local align- 
ments that hide local colinearity. 

In the context of a pipeline for genome alignment that 
consists of the programs Enredo and Pecan [39], another 
graph has been published, the Enredo graph. The pro- 
gram Enredo applies Enredo graphs to partition genomes 
into segments. Subsequently, the program Pecan provides 
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nucleotide-level colinear alignments of segments. Enredo 
graphs have two vertices per set of aligned segments, 
a head and a tail vertex, resembling breakpoint graphs 
from rearrangement studies. The Enredo method itera- 
tively eliminates various substructures from the Enredo 
graph before deriving a final genome segmentation. 

A recent and slightly dissimilar graph is the cactus graph 
[22,40]. Cactus graphs have vertices for adjacencies and 
edges for genome segments. Their structure has two valu- 
able properties. The cactus property subdivides the graph 
(and genomes) into independent units by ensuring that 
any edge is part of at most one simple cycle [41]. These 
units assist in computing genome alignments with the 
cactus alignment filter (CAF) algorithm [22]. The sec- 
ond property is the existence of an Eulerian circuit. This 
circuit traverses all genome segments exactly once, even 
duplicated segments, conveniently providing a consensus 
genome. 

In this paper, we compare the mentioned graph-based 
genome alignment approaches with an emphasis on the 
structures of the underlying graphs. Our aim is to clarify 
similarities of the approaches and the underlying graphs 
but also to work out differences and highlight limitations. 
We realize our comparison using the same terminology 
for all graphs and by describing transformations among 
the graphs (see Figure 2). We assess the graphs in terms of 
their capabilities to display alignment information in their 
structure alone. For all graphs, substructures and modi- 
fications constitute key aspects of corresponding genome 
alignment approaches. We carefully examine substruc- 
tures as well as modifications independently from the 
particular graphs they were first described for. Founded 
on our comparison, we derive a generic framework for 
graph-based genome alignment. The framework gives an 
overview of the general graph-based approach to genome 
alignment and, hence, may assist in the development of 
future genome alignment tools. 

Results 

Terminology 

The biological term homologous denotes two or more 
genomic positions that derived from a single position 



in an ancestral genome, or two or more segments that 
derived from a single segment in an ancestral genome. 
An alignment of genomes is an assignment of positions 
from the aligned genomes. Usually, the goal is to align 
only homologous positions to each other, but since the 
ancestral genome is unknown, an alignment can only be a 
prediction of homology. 

In the following, we formally define a genomic position 
and give a very general definition of an alignment. Next, 
we define a genomic segment and constrain the align- 
ment definition to colinearity. Since colinearity is often 
too strict for predicting homology in whole genomes, 
genome aligners use so-called blocks, which are colinear 
alignments of genomic segments. Blocks can be arbitrar- 
ily combined to non-colinear genome alignments. We give 
a general definition of blocks as the basic entities that 
underlie graph-based genome aligners. Finally, we define 
the terms adjacency and breakpoint. 

Let Q be a set of genomes. Each genome g e Q is 
a sequence of characters from the DNA alphabet X = 
{A, C, G, T}. We define the position of the (i + l) st char- 
acter in genome g as p = (g,i) with 0 < i < \g\, 
where \g\ denotes the length of g. To compare two posi- 
tions with the operators < and >, we assume an arbitrary 
strict total ordering of the genomes (such that for any 
pair of genomes g l9 g 2 e Q either^ < g 2 or g 2 < gi). 
Then, pi < p 2 where pi = (gi,h) and p 2 = (g 2 ,i 2 ), if 
gi < gi or if gi = g 2 and i\ < i 2 . Let Pg = {p | p = 
(g>i)>g £ GfO < i < \g\} be the set of all positions of the 
genomes Q. An alignment component (column) A is a sub- 
set of Pg. For example, all pairs of positions connected by 
a line in Figure 1 form alignment components. Without 
any demand for optimality, an alignment A is simply a set 
of alignment components. 

An ordered pair of two positions p = (g, i) and q = (g,j) 
from the same genome g defines a segment s = (p,q) 
of length \i — j\, where min{/?, q] is the smallest position 
and max{/?, q} the position directly following the largest 
position in the segment. If p < q, the segment is in 
the forward orientation, and if p > q, the segment is 
in the reverse complemented orientation (see Figure 3). 
As an alternative to an ordered pair (p, q), a segment 
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Figure 2 Overview of transformations among four graph representations for genome alignments. Some transformations require 
information from labels (red boxes), which is not present in the graph structures (see text for details). The Cactus method keeps an Enredo-like 
graph in addition to the cactus graph. 
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Figure 3 Three segments of a genome g. Segments si = (pi , q\ ) and S3 = (p$, C73) are in the forward orientation and their sequences read ^GC 
andTCACG, respectively. Segment Sj = ipi.Qi) is in the reverse complemented orientation and reads CCTGC.Si and S3 are non-overlapping but 
not adjacent, s 2 and s 3 are non-overlapping and adjacent at position a = (g, 140). 



could equivalently be represented by a start position, a 
length, and an additional orientation bit. Two segments 
si = (pi, qi) and 52 = (p2> #2)> where without loss of gen- 
erality min{/?i, #1} < min{p2,q2}> are non- overlapping if 
max{/?i,<7i} < min{/7 2) ^2}- If max{/?i, #1} = min{p 2 ,q2}> 
s\ and 52 are adjacent and define the adjacency at position 
<2 = max{/?i, #1} (see Figure 3). Two segments fully over- 
lap if both min{/?i, q{\ = min{/?2>#2} and max{/?i, q{\ = 
max{p 2 ,q2}- 

An alignment A of a set of segments S is colinear if each 
alignment component contains at most one position from 
each segment s e S and if it is possible to impose a strict 
total ordering -< on the alignment components A e A 
(such that for any pair of distinct alignment components 
A\,A2 G A either A\ < A2 or A2 < A\) as follows: The 
relation A\ < A2 holds if for any two positions p\ G A\ 
and P2 G A2 from the same segment s e S,p\ < P2 if s is 
in the forward orientation and P2 < p\ if 5 is in the reverse 
complemented orientation. If an alignment violates the 
conditions for colinearity, it is non-colinear (see Figure 1). 
To put it simply, inversions, duplications, and transloca- 
tions of parts of the aligned sequences are non-colinear 
operations that violate colinearity. 

Non-colinear operations divide an alignment into units 
that are colinear in themselves but not with respect to 
each other. We call these units blocks and define a block 
as a colinear alignment of a set of segments. Note that a 
block may contain multiple segments of the same genome 
if duplications are present. We refer to the number of 
segments in a block as the size of a block (not to be con- 
fused with the length of segments). In Figure 1, areas 
shaded in blue and red indicate blocks. For example 
in the left alignment, the two dinucleotides CG form a 
block and the two dinucleotides AT form another block. 
In the right alignment of Figure 1, the segment ATCG 
and its reverse complement in the second sequence form 
a block. 

A block always has two equivalent representations. In 
the first block representation, some segments are in the 
forward orientation and some may be in the reverse com- 
plemented orientation. In the second block representation 
all segments are in the reverse complemented orientation 
that are in the forward orientation in the first block repre- 
sentation and all segments are in the forward orientation 



that are in the reverse complemented orientation in the 
first block representation. The essential information about 
possible inversions is the orientation of segments with 
respect to each other and not the orientation of the block 
representation. Once we choose one of the two represen- 
tations, we implicitly assign a tail and a head to a block b. 
The head is the set of positions {p} of all segments 5 e b 
with s = (p } q) } and the tail is the set of positions {q}. We 
refer to the two sets as the ends of b in cases where the 
orientation of a block is not given. 

A set of blocks constitutes a genome alignment and is 
input for building a genome alignment graph. To simplify 
the exposition of the graphs below, we define Bg as a set of 
blocks that is a tiling of the genomes Q: All pairs of blocks 
b\, b2 G Bg have to be non-overlapping; for unaligned seg- 
ments between blocks and unaligned segments at the ends 
of the genomes, Bg includes blocks of size 1. 

Two blocks b\,b2 G Bg are adjacent if there are two 
segments si e b\ and S2 G ^2 that are adjacent. Since 
all blocks have two ends, there may be up to four differ- 
ent adjacencies between two blocks: the head of b\ can be 
adjacent to the head of ^2 or to the tail of ^2, or the tail 
of b\ can be adjacent to the head of Z?2 or to the tail of 
&2- Each of the four adjacencies is defined by a set of adja- 
cency positions between segments from the two blocks, 
e. g., if the tail of b\ is adjacent to the head of ^2, the adja- 
cency is defined by the set of positions {q{\ of segments 
si G b\ with si = (p\,q\) for which there is a segment 
52 G b2 with 52 = (p2> #2) where p\ = #2- Since a block can 
contain more than one segment from the same genome, a 
block can be adjacent to itself. In the literature, adjacen- 
cies of blocks are sometimes defined as segments between 
two blocks [39] . Given that the set of blocks Bg is a tiling 
of the genomes, we can refer to an adjacency as a set of 
single positions. 

An adjacency of two blocks b\, b2 G Bg is called a break- 
point if b\ and ^2 are adjacent in only a subset of the 
segments. Then, the set of positions that define the adja- 
cency is smaller than the size of the block. More formally, 
let 5i g b\ and 52 G &2 be two adjacent segments with 
si = (p\,qi) and 52 = (p2> <?2)« Without loss of generality, 
let qi =P2- Then, b\ and ^2 define a breakpoint if there is 
a segment 5^ G b\ with 5^ = (p' v q[) for which no segment 
s f 2 G &2 with s f 2 = (p 2 > #2) ex ists where q[ = p f 2 . 
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Most commonly, genome alignment programs use pair- 
wise local alignment methods to generate blocks. Pairwise 
local alignments are blocks of size two. These blocks can 
be combined with each other to form blocks of a larger 
size (multiple local alignments) if a segment from one 
block fully overlaps with a segment from another block. 
We briefly address this preprocessing of blocks in the Dis- 
cussion section, and assume that a set Bg is given for 
constructing the graphs. 

In the literature, blocks are often referred to as synteny 
blocks or locally colinear blocks. The definitions of blocks 
differ, usually depending on the specific type of local align- 
ment method being used for generating blocks. For exam- 
ple, blocks can be defined as gapped or ungapped colinear 
alignments with or without mismatches, or simply as sin- 
gle alignment components. The graph representations are 
independent from the precise assignment of positions to 
alignment components within blocks. Only the set of seg- 
ments including their relative orientation within the block 
is relevant. For this reason, the different block definitions 
can be used interchangeably except for preprocessing the 
set of blocks to obtain Bg (see Discussion section). 

Within the graphs described in the following sections, 
blocks and adjacencies are represented by vertices or 
edges or a combination of both. For each graph, every 
genome is a (not necessarily simple) path through the 
graph. We use the term to thread for following the path of 
a genome through the graph [17,35]. 

Graphs for genome alignment 

We limit our comparison to alignment graphs, A-Bruijn 
graphs, Enredo graphs, and cactus graphs. The origi- 
nal publications of these graphs use varying terminology. 
We describe all four graphs using the same terminology, 
namely the above defined terms segment, block and adja- 
cency. Figure 4 displays an example alignment with eleven 
blocks as alignment graph, A-Bruijn graph, Enredo graph, 
and cactus graph. 

The input for building a graph is a set of non- 
overlapping blocks Bg defined on a given set of genomes 
Q. We assume the blocks to be a tiling of Q. For all four 
graphs, we define the graph structures G = (V,E) as 
ordered pairs of vertices V and edges E. In addition, we 
define graph models Mq = (G, I), which are ordered pairs 
of the respective graph structure G and a labeling function 
I. Most original publications remain vague about labels 
on vertices and edges of the graph structures. We define i 
such that the set of blocks Bg can be recovered from Mq. 

Along with the definitions of G and Mq for each of 
the four graphs, we describe how it is possible to trans- 
form the different graph structures into each other (e. g., 
an alignment graph structure into an A-Bruijn graph 
structure). A transformation is an operation that has as 
input one graph structure G and outputs another graph 



structure G f , where both G and G' represent the same 
genome alignment. If it is possible to obtain a graph struc- 
ture G' from another graph structure G without the help 
of Bg and without a labeling function, then G provides at 
least as much alignment information as G \ Moreover, if 
a graph structure G provides less information about the 
alignment than another graph structure G f , a transforma- 
tion from G to G' is ambiguous, thus, impossible. 

We examine the transformations that are depicted as 
arrows in Figure 2. Straight arrows indicate a possible 
transformation; the other arrows indicate that a trans- 
formation among the structures is impossible, which we 
prove below by providing examples for ambiguity. Never- 
theless, we describe all transformations depicted as arrows 
in Figure 2, using additional information from graph labels 
if necessary to resolve ambiguity. We define the sparse 
labeling functions t du P, i in \ or l ad i for this purpose. The 
sparse labeling functions provide sufficient information 
for the transformation but less information than i in the 
graph models. Note that a transformation among graph 
models is trivial given that Bg can be recovered from the 
model Mq of any of the four graphs. The need for labels to 
resolve ambiguity proves that the graph structures G differ 
in their information content. 

Alignment graphs. In the following section, let G = 
(V,E) be an alignment graph structure and Mq = (G,l) 
be an alignment graph model. We define i as a labeling 
function of the vertices V of G. The set of edges E = 
Ea UEb decomposes into a set of directed adjacency edges 
Ea and a set of undirected block edges With both 
directed and undirected edges, G is a mixed graph. 

The vertices V of G represent segments of the genomes. 
There is a vertex in V for every segment in the set of all 
segments (Sb = [JbeBg b) from all blocks Bg. The func- 
tion i : V -> Sb labels each vertex v e V with the 
corresponding segment s e Sb such that l(y) = s. 

Directed adjacency edges Ea (colored edges in Figure 4) 
represent adjacencies of segments. Given any pair of ver- 
tices vi,V2 e V and their labels l(v\) = (p\,q\) and 
t(yi) = (p2>42)> there is a directed edge e e Ea from v\ 
to V2 if max{/?i, q{\ = min{/?2> #2}> i- e., the segment iiyi) 
is adjacent to the segment l(y\) in Q. Adjacency edges 
thread the genomes through the alignment graph. 

Finally, undirected block edges Eb (gray edges in 
Figure 4) connect vertices labeled with segments from the 
same block b e Bg. For each vertex v\ e V with i(v\) e b, 
there are undirected edges Eb between v\ and any other 
vertex V2 £ V with l(y<i) e b. As a consequence, each 
block b e Bg forms a block edge connected component 
C c V in the alignment graph. 

The formation of connected components is important 
for recovering Bg fromMG- Let C be the set of block edge 
connected components of G. It holds V = UcgC ^ an( ^ 
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Alignment graph: 




Enredo graph: 




Figure 4 An alignment of three genomes with eleven blocks in all four graph representations. The example covers multiple non-colinear 
events: Blocks A, E, I, K are conserved in all three genomes without large structural changes. Blocks B, C, and D, as well as G, and H appear in different 
orders and orientations. Block F is missing in the red genome and J occurs twice. Colors denote the three genomes. In the alignment graph, dashed 
edges indicate the alignment of a segment with its reverse complement. We consider the information provided by line styles not to be part of the 
graph structures. In the Enredo graph, components connected by adjacency edges are shaded in gray. For the cactus graph, the figure additionally 
shows a precursor. Furthermore, enlarged vertices for the precursor and the final cactus graph show adjacencies in vertices. 



C\ fl C2 = 0 for any Ci, C2 e C. Thus, we may recover 
Bg by forming a block b = [i(v) \ v e C} for every 
component CgC. 

Our definition of the alignment graph structure G mod- 
els non-colinear changes among the input genomes, in 
particular translocations and duplications. Translocations 



appear in G as mixed cycles. A mixed cycle is a cycle in 
a mixed graph formed by both directed and undirected 
edges. Duplications appear as block edges within the set of 
vertices of one genome. Because of these edges our align- 
ment graph is not n-partite as in its original definition 
[32]. 
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Inversions are not visible in the alignment graph struc- 
ture G; the orientation of segments remains unclear (see 
also Figure 5). We define the sparse labeling function l mv : 

V -> {+, -} as 

^(v) = f + ^ < « 
[ - lip > q, 

where i(y) = (p,q). The function i inv assigns bits to the 
vertices that indicate the orientation of the represented 
segments. As an alternative to vertex labels, it is possi- 
ble to label block edges with bits that indicate equal or 
opposite orientation of the segments in the endpoints 
(visualized as dashed and solid lines in Figure 4 or red and 
black edges in [42]). 

A-Bruijn graphs. Let now G = (V,E) be an A-Bruijn 
graph structure and Mq = (G, i) be an A-Bruijn graph 
model. A-Bruijn graphs have only one type of edge E. We 
define i as a labeling function of the vertices V, In con- 
trast, the functions i inv and i dup described below provide 
labels for the edges £. 

The vertices V of G represent blocks. For every block 
in Bg, there is a vertex in V. There is only one vertex per 
block regardless of the blocks size and of duplications. 
The function i : V -> Bg labels each vertex v e V with 
the corresponding block b e Bg such that i(y) = b. With 
this labeling, recovering Bg from Mq is straightforward. 

The edges E of G represent adjacencies just like adja- 
cency edges in alignment graphs. Given any pair of ver- 
tices vi, V2 e V and their labels b\ = l(y\) and b<i = Kyi), 
there is a directed edge e e E from v\ to V2 for every 
two adjacent segments si = (p\,qi) and S2 = (p2>q2) 
with max{/?i,#i} = min{/?2> #2} where si e b\ and 




A-Bruijn graph 

Figure 5 The structure of alignment graphs and A-Bruijn graphs 
does not display inversions. Let B denote the reverse 
complemented block representation of a block B. The alignment 
graph structure and A-Bruijn graph structure for the two sequences 
ABC and ABC is the same as for the two sequences ABC and ABC. The 
orientation of blocks remains unclear in the graph structures. 
However, there are two Enredo graph structures for the alternative 
orientations of block B in one sequence. Thus, the transformation 
from the alignment graph structure or A-Bruijn graph structure to an 
Enredo graph structure is ambiguous with two alternatives. 



52 G &2« If multiple adjacent pairs of segments exist in 
b\ and b2, E contains multiple edges from v\ to V2. Thus, 
G is a multi-graph. In the present paper, we prefer the 
multi-graph representation with multiple separate edges 
between two vertices over the multi-graph representation 
with multiplicity labels on edges. 

Adjacency edges are essential for threading genomes 
through G. However, the path from threading one genome 
is not necessarily simple. It traverses vertices multiple 
times if duplications are present (see block J in Figure 4) 
making the path ambiguous. Thus, threading requires 
label information that allows incoming and outgoing 
edges of a vertex to be paired. Such information is not 
required in the alignment graph structure, where each 
vertex has at most one incoming and one outgoing edge. 
Without duplications it is sufficient to color edges of G by 
genome (red, blue, and green in Figure 4) instead of pro- 
viding the full labels L In the presence of duplications, G 
can be ambiguous even with color labels (see Figure 6 and 
block J in Figure 4). 

To resolve ambiguity of G for threading, we define the 
sparse labeling function i dup : E -> N as a total ordering 
on the edges. This function assigns numbers to the edges 
that describe the order of the adjacencies in the genomes. 
In Figure 6, for example, the edges could be numbered 1 
through 8: for genome ABDABCEBC, one edge from A to 
B would be labeled with 1, the edge from B to D would 
be labeled with 2, the edge from D to A would be labeled 
with 3, and so on; for genome ABCEBDABC, also one 
edge from A to B would be labeled with 1, but then the 
edge from B to C would be labeled with 2, and so on. To 
describe i dup formally, we use adjacency positions of the 
edges E that we determine with the help of L Given a pair 
of vertices vi, V2 e V, each edge e e E from v\ to V2 
corresponds to one pair of adjacent segments s\ e t(v\) 
and 5 2 e l(v 2 ) with 5i = (p\,q\) and 5 2 = (^2,^2). The 
adjacency position of e is a — max{/?i, q{\ = min{/?2> qi}- 
Then, i dup assigns numbers to edges in E such that 

i dup (ei) < l dup (e 2 ) 11 a\ < ci2, 




A-Bruijn graph Enredo graph 

Figure 6 Duplications may create ambiguity in the structure of 
A-Bruijn graphs and Enredo graphs. In this example, the structure 
of the A-Bruijn graph and the Enredo graph represents both the 
genomes ABDABCEBC and ABCEBDABC. Thus, the order of blocks is 
ambiguous. 
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where a\ is the adjacency position of e\, and a<i is the 
adjacency position of e<i . 

Furthermore, inversions create ambiguity in G (see 
Figure 5). Just like the alignment graph structure, G pro- 
vides no information about the orientation of segments 
represented in a vertex. We define the sparse labeling 
function i inv : E {+, -} x {+, -} for A-Bruijn graph 
edges E. For each pair of adjacent segments s\ = (p\,q\) 
and 52 = (p2> #2) from the labels of two vertices si e l(y\) 
and 52 e t(v2)i we label an edge e = (vi, V2) with 



i mv {e) 



(+, +) if pi < qi A p 2 < q2> 
(+, -) if pi < qi A p 2 > q2> 
(-, +) if pi > qi A p 2 < q2, 
(-, -) if pi > q\Ap 2 > q2- 



The first bit in the label l inv (e) indicates the orientation 
of the segment in the source vertex of e, and the second 
bit the orientation of the segment in the target vertex. It is 
not sufficient to solely label vertices of G with one orien- 
tation bit per segment of the represented block. Figure 7 
provides an example where this leads to ambiguity. 

Below, we describe transformations between A-Bruijn 
graphs and alignment graphs. As stated above, the trans- 
formation of the graph models is trivial, but the example in 
Figure 6 proves that in some cases it is impossible to trans- 
form an A-Bruijn graph structure into an alignment graph 
structure. We describe the transformation with the help of 
the sparse labeling function i dup to resolve ambiguity. 

A-Bruijn graphs from alignment graphs. To transform 
an alignment graph structure G' = (V \E' B U E A ) into 
an A-Bruijn graph structure G = (V,E), we follow 
the description of A-Bruijn graphs in [35] and exploit a 
many-to-one mapping from alignment graph vertices to 
A-Bruijn graph vertices. The transformation is possible 
without additional information from a labeling function. 

As a first step, compute all block edge connected com- 
ponents C of G '. As described above, each component 



A-Bruijn graph 



Two alternative Enredo graphs 



Figure 7 The labeling of A-Bruijn graph vertices with one 
orientation bit per segment does not resolve ambiguity. In this 
example, both blocks occur three times, twice in the forward 
orientation and once in the reverse complemented orientation. 
Combining the orientations of the segments in the two blocks is 
ambiguous as the two alternative Enredo graph structures prove. In 
the left Enredo graph structure, the segment in the reverse 
complemented orientation of one block is combined with a segment 
in the forward orientation of the other block. In the right Enredo 
graph structure, the two segments in the reverse complemented 
orientation occur consecutively. 



represents exactly one block, and each vertex v' e V is 
part of exactly one component. Now, add for every com- 
ponent C e C a vertex to the set of A-Bruijn graph vertices 
V. We obtain a many-to-one mapping of alignment graph 
vertices to A-Bruijn graph vertices. We keep the map- 
ping as a label m[v'] = v on each alignment graph vertex 
v' e V ' . The label indicates the A-Bruijn graph vertex v 
that represents the connected component containing v' . 

The remaining task is to transfer adjacency edges from 
the alignment graph to the A-Bruijn graph. Using the 
mapping, add an edge e = (u, v) to the set of A-Bruijn 
graph edges E for each edge e' = (u f , v') from the set of 
alignment graph adjacency edges E A , where u = m[u f ] and 
v = m[v f \. 

A-Bruijn graphs to alignment graphs. We describe the 
transformation of an A-Bruijn graph structure G = ( V, E) 
into an alignment graph structure G' = (V',E A U£^) given 
the labeling function i dup for the edges of G in addition 
to G. In accordance with the transformation from G' to 
G, we successively create a one-to-many mapping from A- 
Bruijn graph vertices to alignment graph vertices during 
the transformation. The mapping m labels each A-Bruijn 
graph vertex v e V with a set of alignment graph vertices 
from V' . At the beginning m[v]= {} for all v e V. At the 



end, a label m[ v] ■ 



, v\ h ,} indicates the set of align- 



ment graph vertices {v' v . . . , vj^} that form the connected 
component for a block b represented by v in the A-Bruijn 
graph. 

We transform the A-Bruijn graph by following each 
genome separately and assume that the edges are given in 
increasing order of labels: i dup {e\) < l dup (e2) < < 
i dup (e\£\). Initially, add for each genome a new vertex u' to 
the set of alignment graph vertices V' . If the source vertex 
u of the A-Bruijn graph edge e\ = (u, v) is labeled with a 
non-empty set of vertices m[u] = {u' v . . . , u f k ], add undi- 
rected edges between u' and all vertices u' v . . . , u f k to the 
set of alignment graph block edges E' B . Next, add u' to the 
set of vertices mapped to u. Repeat these three steps for 
the target vertex v of ey. add a vertex v r , add block edges, 
and add v' to the mapping. In addition, add a directed edge 
from u' to v' to the set of alignment graph adjacency edges 
V 

A' 

Iterate over the A-Bruijn graph edges in increasing order 
of labels and repeatedly add for the target vertex a new 
vertex, add block edges, add the new vertex to the map- 
ping, and add an adjacency edge from the previous to the 
new vertex. This way, the genomes are threaded through 
the A-Bruijn graph and the alignment graph structure G 
is successively built up. 

Enredo graphs. In this section, let G = (V,E) be an 
Enredo graph structure and Mq = (G, i) be an Enredo 
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graph model In an Enredo graph, the set of edges E = 
Ea UEb decomposes again into a set of directed adjacency 
edges Ea and a set of undirected block edges We define 
i as a labeling function of the block edges £5. 

The block edges Eb of G represent blocks, and vertices V 
of G represent the ends of blocks. In contrast to alignment 
graphs, a single block edge represents an entire block. For 
every block b e Bg, there are two vertices in V con- 
nected by an undirected block edge e Eb (black edges 
in Figure 4). The function i : Eb -> Bg labels each block 
edge e\t e Eb with the corresponding block b e Bg such 
that t(eb) = b. By choosing one of the two possible block 
representations as label, the two vertices that are con- 
nected by e\j are implicitly labeled as head and tail of the 
block. Note that they are not labeled as head or tail in G. 
Given the block labels, recovering Bg from Mq is again 
straightforward. 

Directed adjacency edges Ea of G (colored edges in 
Figure 4) represent adjacencies. Given any pair of block 
edges e\ = {u t , u^} and e2 = {vu Vh) an d their labels 
i{e\) = b\ and l(e<i) — b^, there is a directed edge e e Ea 
from an endpoint of e\ to an endpoint of e<i for every two 
adjacent segments si = (p\,q\) and S2 = (P2>tf2) with 
s\ e b\ and S2 e b<i< In contrast to alignment graphs and 
A-Bruijn graphs, the endpoints of adjacency edges in G 
indicate in relation to other adjacency edges the orienta- 
tion of segments in a block. Given labels, one endpoint of 
each block edge is a head vertex and the other a tail ver- 
tex. If p\ < q\, then the adjacency edge e starts at the head 
vertex u^, and ifpi > q\, e starts at the tail vertex u t . If 
P2 < q2> then e points to the tail vertex v t , and if P2 > q2> 
e points to the head vertex v^. In other words, e = (u^, v t ) 
if qi =p 2 ,e= (u h , v h ) if q\ = qi, e = (u u v t ) if pi = P2> 
and e = (u t , v^) if p\ = #2- Again, there may be several 
adjacency edges connecting the same two vertices. Thus, 
the Enredo graph is also a multi-graph. 

Due to its two-vertex concept, the structure of an 
Enredo graph G reflects the relative orientation of blocks 
as opposed to the alignment graph structure and the 
A-Bruijn graph structure (see Figure 5). G is capable 
of displaying inversions. But just like A-Bruijn graphs, 
threading a genome with duplications through G can be 
ambiguous (see Figure 6). The path from threading a 
genome through G alternates between block and adja- 
cency edges. Therefore, only multiple occurrences of a 
block in the same orientation create ambiguity in G. 

To resolve ambiguity of G, we define the sparse label- 
ing function i dup : Ea -> N as a total ordering on the 
adjacency edges. As for A-Bruijn graphs, we can use the 
labeling function i to determine the adjacency position of 
an edge e e Ea> The function l dup assigns again numbers 
to the edges Ea such that 

i du P( ei ) < i du P{e 2 ) 11 a\ < a 2) 



where a\ is the adjacency position of e\, and a 2 is the adja- 
cency positions of 62. As an example, we use again Figure 6 
with labels 1 through 8: One of the edges from the head 
of A to the tail of B would be labeled with 1; for genome 
ABDABCEBC, the edge from the head of B to the tail of D 
would be labeled with 2, and for genome ABCEBDABC, 
the edge from the head of B to the tail of C would be 
labeled with 2; and so on. 

We generalize the Enredo graph compared to its original 
definition [39] in some aspects. Enredo graphs originally 
consider blocks of size 1 as adjacencies: Instead of a block 
edge with two end vertices that are connected to the rest 
of the graph by two adjacency edges, the Enredo method 
only adds a single adjacency edge labeled with a segment. 
This requires another function Ia : Ea — > S that labels 
adjacency edges Ea with segments S. In addition, in the 
initial phase of the Enredo method segments on adjacency 
edges between the same two blocks are assumed to be 
homologous. Because of this assumption and to distin- 
guish non-homologous multi-edges later on, the Enredo 
method prefers the multi-graph representation with mul- 
tiplicity labels on one adjacency edge over multiple sepa- 
rate edges. We argue that all segments that are assumed 
to be homologous should be defined as blocks. Conse- 
quently, our description with blocks of size 1 is valid and 
even simplifies the exposition of the method. 

Furthermore, the Enredo method only adds edges for 
adjacencies that are shorter than a predefined thresh- 
old. This results already in a partial segmentation of the 
genomes bearing several segments per genome in the 
graph. Parts of the genomes may not be represented. We 
add all adjacencies to the graph and leave it to later stages 
to modify the graph. 

In the transformations below, we include the replace- 
ment of labeled adjacencies by blocks of size 1. The trans- 
formation from an Enredo graph structure to an A-Bruijn 
graph structure is possible without additional labels. The 
other direction, from A-Bruijn graphs to Enredo graphs, 
requires additional information about inversions as shown 
by the example in Figure 5. 

Enredo graphs from A-Bruijn graphs. First, we describe 
the transformation of an A-Bruijn graph structure G' = 
(V',E f ) into an Enredo graph structure G = (V.Eb^Ea) 
using the labeling function i inv : E' {+, — } x {+, — }. 
Then, we describe the transformation of blocks of size 1 
to labeled adjacency edges in the Enredo graph given full 
block information by the function i for transferring labels. 

To transform G' into G, add for each A-Bruijn graph ver- 
tex v' e V a tail vertex v t and a head vertex vj^ to the set of 
Enredo graph vertices V. Additionally, add an undirected 
edge e\j between v t and to the set of Enredo graph block 
edges £5. We obtain a one-to-one mapping of A-Bruijn 
graph vertices and Enredo graph block edges, which we 
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keep as separate labels m : V —> Eb on A-Bruijn graph 
vertices such that m[v f ] = 

Using the labeling function i inv and the mapping m, 
we can unambiguously transfer adjacency edges to the 
Enredo graph. For each edge e' = (u v') in the set of A- 
Bruijn graph edges E f where m\u'\ = e u and m\v'\ = e v , 
add an edge e = (u x , v y ) to the set of Enredo graph adja- 
cency edges Ea where u x is an endpoint of e u and v y is an 
endpoint of e v . The vertex u x is the head vertex of e u if the 
first bit ini(e') is +, and otherwise the tail vertex. The ver- 
tex v y is the tail vertex of e u if the second bit in i(e') is +, 
and otherwise the head vertex. 

In another step, we can transform all block edges e& = 
{vt> Vh\ representing blocks of size 1 into adjacency edges. 
Since the size of l(eb) is 1, the corresponding vertices v t 
and v/z are incident to exactly one adjacency edge each, 
d = (u x , v t ) and e<i = (v^, w y ). Replace such sets of two 
vertices v t , v^ and three edges e\, e<i, by a new adjacency 
edge e = (u x , w y ). Finally, transfer the label of the block 
edge l(eb) = {s} to the adjacency edge such that ^(g) = s. 

Enredo graphs to A-Bruijn graphs. We start by describ- 
ing how to recover block edges for blocks of size 1 from 
adjacency edges that are labeled with segments by I a in 
an Enredo graph structure G = (V,Eb U Ea). Afterwards, 
we describe the transformation from G to an A-Bruijn 
graph structure G' = (V',E f ), which is possible without 
additional labels. 

Replace each edge e = (u x , v y ) from the set of Enredo 
graph adjacency edges Ea, where ^(e) = s, by two ver- 
tices vt and Vh and a block edge = (v t , v^), and set 
i{e\j) = {s}. Further, add e\ = (u x , v t ) and e<i = (v^, v y ) to 
the set of Enredo graph adjacency edges Ea. 

For the transformation to an A-Bruijn graph, add for 
each edge = (v^, v t ) in the set of Enredo graph block 
edges Eb, a vertex v' to the set of A-Bruijn graph vertices 
V ' . Again, we obtain a one-to-one mapping of Enredo 
graph block edges and A-Bruijn graph vertices, which we 
keep this time as labels m \ Eb —> V' on Enredo graph 
block edges such that m\e^\ = V ' . Finally, add for each edge 
e = (u x , v y ) in the set of Enredo graph adjacency edges Ea 
where u x is incident to the block edge e u and v y is incident 
to the block edge e vi an edge e' = (u f , v') to the set of A- 
Bruijn graph vertices, where m[e u ] = u' and m[e v ] = V \ In 
this last step, we lose inversion information in the graphs 
structure. 

Cactus graphs. In this section, let G = (V, E) be a cactus 
graph structure andMG = (G, I) be a cactus graph model. 
Cactus graphs have only one type of edges. We define i 
as a labeling function of the edges E. The cactus graph 
structure G stands out from the other graph structures 
by fulfilling well-defined structural properties: Every edge 
e e E is part of at most one simple cycle, which makes G 



a cactus graph [41], and G has an Eulerian circuit [43]. A 
number of construction steps guarantee these properties. 

Let A be the set of all adjacencies of segments. The ver- 
tices V of G partition A into a set of pairwisely disjoint 
subsets Q: Each element v e Q is a subset of A, fi fl v = 0 
for any two sets /x, v e Q, and UvgA v — A. For each sub- 
set v e Q, there is a vertex in V. In addition, G has one 
distinct vertex, the origin, that represents the start and end 
of all genomes 0. In Figure 4, for example, the vertex 0 
represents the start and end, the vertex a U f$ represents 
a subset of twelve adjacencies, the vertex y represents a 
subset of eleven adjacencies, and the vertex 8 represents 
a subset of seven adjacencies. In the vertex <5, for exam- 
ple, three of the adjacencies are from the red genome, two 
from the blue genome, and another two from the green 
genome as shown in the enlarged vertices. The subsets 
correspond to subgraphs of the Enredo graph (shaded in 
gray in Figure 4). All adjacencies at one end of a block are 
always part of the same subset. We describe the details on 
how to determine the subsets below in the transformation 
from Enredo graphs. 

The edges E of G represent blocks just like block edges 
in Enredo graphs. For each block b e Bg, there is an undi- 
rected edge e = {u, v} in the set of cactus graph edges E 
(black lines in Figure 4). The endpoint u e V of e rep- 
resents a subset of adjacencies fi e Q that contains all 
adjacencies at one end of b, and the other endpoint v e V 
represents v e Q that contains all adjacencies at the other 
end of b. It is possible that u = v. The function i : E —> Bg 
labels each block edge e e E with the corresponding block 
b e Bg such that 1(e) = b. With this labeling, recovering 
Bg from Mq is again straightforward. 

The cactus graph has no directed edges as found in 
other graphs. Since vertices of G represent segment adja- 
cencies in sets, the size of blocks and the number and 
precise set of adjacencies remain unclear in the struc- 
ture. Recovering this information from G is impossible as 
the following examples from Figure 4 demonstrate: The 
cactus graph structure does not tell how many genomes 
traverse block F and whether block I and K are adjacent in 
one of the genomes or not. 

Still, each genome corresponds to a (not necessarily 
simple) path through G. With the help of labels we can 
recover this path. The colored lines in the enlarged ver- 
tices in Figure 4 provide the equivalent information as col- 
ored adjacency edges in Enredo graphs and would resolve 
ambiguity for threading if no duplications were present. 
More information is necessary to resolve all ambiguity. 
We suggest l ad i : E -> 2 {+ '- }xN , where 2 X denotes the 
power set of a set X, to label the edges with lists of pairs 
of an orientation bit and a positive number. The lists have 
an entry for each segment of the blocks b = 1(e). The 
orientation bits are necessary to determine the relative 
orientation of segments within blocks that are represented 
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by edges e = (v, v) (see blocks B, C, D, F, G, H, and J in 
Figure 4). The numbers impose a strict total ordering < 
on all genome segments si, S2 G S where si -< S2 if 5i is left 
ofs 2 . 

Cactus graphs are not as independently used as the 
other genome alignment graphs. The cactus method oper- 
ates on two graphs, the cactus graph and another graph 
called the adjacency graph [40]. Interestingly, the lat- 
ter has the same structure as an Enredo graph. We 
view the cactus graph, which enables the characterization 
and detection of new substructures, as a supergraph on 
top of the Enredo graph. The transformation of Enredo 
graph structures to cactus graph structures conforms 
with the construction of a cactus graph [22,40] and does 
not require additional labels. The transformation back 
to Enredo graphs is ambiguous as the above mentioned 
examples from Figure 4 show. For this reason, our descrip- 
tion of this transformation uses the sparse labeling l ad i in 
addition to the graph structure. 

Cactus graphs from Enredo graphs. To transform an 
Enredo graph structure G' — (V',E B U E' A ) into a cac- 
tus graph structure G = (V,E), we follow three steps 
described in [22,40]. First, we transform the Enredo graph 
into a precursor cactus graph. The second and third steps 
modify the precursor to ensure the structural properties 
of cactus graphs. The second step guarantees that every 
edge is part of at most one simple cycle. After the third 
step, the graph is Eulerian. Throughout all steps, we make 
use of a many-to-one mapping m : V —> V from Enredo 
graph vertices V' to cactus graph vertices V, which labels 
each Enredo graph vertex V e V' with a cactus graph 
vertex v e V such that m\y'\ = v. 

First, compute all adjacency -edge connected components 
Ca in the Enredo graph structure G \ Each component 
C £ Ca is a subset of the vertices V . For each C e Ca> 
add a vertex to the set of cactus graph vertices V. Assum- 
ing that the start and end of all genomes are connected, 
add only one origin vertex for all of them to V. We obtain 
the many-to-one mapping that indicates the cactus vertex 
representing the adjacency edge connected component of 
any Enredo graph vertex. Given this mapping, transfer the 
Enredo graph block edges Eb to the cactus graph: For each 
edge e' — {u V} in the set of Enredo graph block edges 
Eb, where m\u'\ — u and m[V] = v, add an edge e = {u, v} 
to the set of cactus graph edges £. It is possible that u = v 
even if u! ^ v' . This yields the precursor cactus graph in 
Figure 4. 

In the second step, remove sets of vertices from V that 
are 3-edge-connected and add instead a single vertex v to 
V (vertices a and /3 in Figure 4). Correct the mapping m 
and redirect block edges that were incident to any ver- 
tex in the 3-edge connected component, to be incident 
to v. 



Finally, replace connected components formed only by 
edges whose removal disconnect the graph (not present 
in Figure 4). Each such component is a tree with leaf and 
branching vertices vi, . . . , v c . Remove v\, . . . , v c and add 
instead a new vertex v to V. Just as before, correct the 
mapping m and redirect incident block edges to v. 

Cactus graphs to Enredo graphs. In the transformation 
from a cactus graph structure G = (V,E) to an Enredo 
graph structure G' = (V',E' B U E' A ), we use the labels l ad i 
to separate the sets of adjacencies represented by cactus 
graph vertices V and to add edges E' A that represent sin- 
gle adjacencies to the Enredo graph structure. In addition, 
we create a one-to-one edge mapping m \ E —> E' B that 
labels each cactus graph edge e e E with an Enredo graph 
block edge e' h e E' B where e' h — {u' ', v'}. In this mapping, 
we store a direction of each Enredo graph block edge and 
distinguish the tail vertex u' from the head vertex v' such 
that m[e] = (u f ,v f ). The direction is not present in the 
Enredo graph structure G r . The transformation proceeds 
by threading the genomes Q through G. 

Initially, identify among all cactus graph edges incident 
to the origin vertex u e V the edge eo = {u, v} whose 
label contains the smallest number no e l ad i(eo) where 
no < n and n e i ad \e) with e = {u,x}. Add two vertices 
u' and v' to the set of Enredo graph vertices V' and an 
edge e' h — {u' ', V} to the set of Enredo graph block edges 
E B . Update the mapping such that m[eo\ = (u v r ). Keep a 
reference to v' for the next step. 

Among all edges incident to v, identify the next edge 
e\ = (v, w) whose label contains the next larger number 
n\ e l ad j(e\) such that n\ > no but n\ < n where n ^ no 
and n e l ad i(e) with e = {v,x\. If the mapping for e\ is 
undefined, add two vertices v" and w' to the set of Enredo 
graph vertices V and an edge e' h = {V 1 W} to the set of 
Enredo graph block edges E' B . Update the mapping such 
that m[e\] = (v", W). Further, add an edge e' a = (v r , v ff ) to 
the set of Enredo graph adjacency edges Ea and keep W 
for the next step. If the mapping for e\ is already defined 
with m[e\]= (v /r , W), only add an adjacency edge: If the 
orientation bit in t ad i{e\) is +, add the edge e' a — (v 1 v") 
to the set of Enredo graph adjacency edges Ea and keep W 
for the next step. If the orientation bit is — , add an edge 
e' a = (v r , W) and keep v" for the next step. 

Next, repeat the same for incident edges of w. Proceed 
like this until reaching the end of all genomes to obtain the 
full Enredo graph structure G \ 

All in all, the need for labels shows that the four 
graphs markedly differ in the information represented in 
their structures. Complete duplication information (dup) 
is only present in alignment graph structures, and only 
the structure of Enredo graphs reveals inversion informa- 
tion (inv). A-Bruijn graphs are a compact and intuitive 
representation but lack both inversion and duplication 
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information. Finally, cactus graph structures do not rep- 
resent parts of the adjacency information (adj). Despite 
these structural differences, all graph models, which 
include labels, can be transformed into each other. 

Based on these observations, some advantages or dis- 
advantages of the graph structures become apparent. For 
example, for a genome aligner intended to reveal inver- 
sions, an Enredo graph structure appears to be more 
suitable, whereas a more general analysis of the genetic 
content of genomes will work well with the more compact 
A-Bruijn graph structure. Duplications are best visible in 
an alignment graph structure. The advantage and infor- 
mation provided by cactus graphs subdivides genomes 
into independent regions revealing specific and unique 
substructures as described in the following. 

Graph substructures 

We collected substructures from graph-based genome 
alignment approaches and classify them here into four 
types: colinear paths, visiting blocks, short cycles, and 
cactus groups. Substructures are useful for deriving 
a meaningful genome segmentation or they indicate 
misalignment, i.e., the alignment of non-homologous 
segments. Furthermore, they pinpoint parts of genome 
alignments that can be improved through modification. 

Some substructures have been described for several 
graph-based approaches, while others are unique to only 
one approach. We conjecture that it is possible to identify 
all substructures in all graph models. If the time com- 
plexity for detecting occurrences of the substructures was 
the same in all graphs models, they could be used inter- 
changeably. Here, our aim is to analyze abilities of the 
graph structures to reveal potential misalignments with- 
out additional information from labels. 

Colinear paths. We refer to the first type of substruc- 
tures as colinear paths, Colinear paths are sets of blocks 
that appear in one or more genomes consecutively in the 
same orientation and without breakpoints in between. A 
sequence of blocks b\, . . . , forms a colinear path if there 
is an adjacency but no breakpoint between b{ and for 
all i = 1, . . . , k — 1. Consequently, all blocks along a colin- 
ear path have the same size and there are segments si e b\ 
and Sk e with si = (p\,q\) and s^ = (p^ q^) such that 
s = (pi,qk) is a consecutive genome segment that con- 
catenates one segment from each block b\, . . . , b^. We also 
consider a single block as a colinear path. 

A colinear path is maximal if it cannot be further 
extended by other adjacent blocks, but is bounded by 
breakpoints. Note that alignment modifications often 
remove bounding breakpoints such that a colinear path 
can again be further extended. The set of maximal col- 
inear paths of a genome alignment determines the final 
genome segmentation. Independent from the underlying 



graph structure, all graph-based genome alignment meth- 
ods have the common aim to maximize colinear paths 
both in terms of size (number of genome segments) and 
length (total segment lengths). 

In Enredo graphs, simple non-branching paths are col- 
inear paths. Similarly, colinear paths appear as non- 
branching paths in the A-Bruijn graph structure, but here 
a non-branching path is not necessarily a colinear path. 
Along a non-branching path in A-Bruijn graphs, one or 
more blocks can be inverted in a subset of the genomes. 
The structure of A-Bruijn graphs provides no information 
about inversions (see Figure 5). Thus, to detect colin- 
ear paths in A-Bruijn graphs, information from labels is 
necessary. Only a single vertex is detectable as (not neces- 
sarily maximal) colinear path in the structure of A-Bruijn 
graphs. The same holds for the alignment graph structure: 
The detection of consecutive blocks is straightforward, 
but in order to avoid the inclusion of inverted blocks 
that break colinearity, additional information about inver- 
sions is necessary. And finally, colinear paths appear in the 
cactus graph structure as non-branching paths although 
non-branching paths are not necessarily colinear paths. 

Visiting blocks. We name the second type of substruc- 
ture visiting block, which conceptually is a special type 
of a maximal colinear path. A maximal colinear path 
{b\, . . . , bfc] is a visiting block if there is a block bo adjacent 
to b\ and a block bj <+ \ adjacent to b^ with the following 
two symmetric conditions (without loss of generality, we 
assume that the tail of bo is adjacent to the head of b\ and 
the tail of b^ is adjacent to the head of bj <+ i): For all seg- 
ments s = (pi,qk) of the colinear path that are adjacent 
at position p\ to a segment so £ bo> there is a segment 
e bj <+ \ adjacent to s at position q^; and for all seg- 
ments s = (pi,qk) that are adjacent at position q^ to a 
segment s^+i e bk+i, there is a segment so £ adjacent 
to s at position pi. The important property is that all seg- 
ments from block bo that are adjacent to segments of the 
colinear path, continue in the same block bj <+ \ at the other 
end of the colinear path and vice versa. 

A visiting block arises from merging blocks from within 
a colinear path with other blocks. If the merged blocks 
are short, they often only have spurious similarity. Hence, 
they break colinearity at two positions without provid- 
ing much evidence for a large structural change. This is 
a reason why genome aligners address visiting blocks and 
separate the otherwise colinear paths. 

In A-Bruijn and Enredo graphs, visiting blocks appear 
as simple non-branching paths bounded by branching ver- 
tices. In Enredo graphs, the path always starts and ends 
with a block edge. In both graphs, at least one branch 
which enters the visiting block at one end must be formed 
by a set of segments that leaves the visiting block as its 
own separate branch at the other end (see Figure 8). This 
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Figure 8 Visiting blocks are not distinguishable in the structure of A-Bruijn graphs and in the structure of Enredo graphs. In this example, 
only the colors of adjacency edges reflect a difference between the substructure at the top (visiting block) and at the bottom (no visiting block). We 
consider colors as edge labels, which are not present in the graph structures. Thus, visiting blocks do not form unique substructures in the structures 
of A-Bruijn and Enredo graphs. 



condition makes it impossible to identify visiting blocks 
in the A-Bruijn and Enredo graph structures. Likewise, 
the structure of cactus graphs alone does not reveal vis- 
iting blocks. Only in the structure of alignment graphs, it 
is possible to determine whether a given colinear path is a 
visiting block or not. 

Visiting blocks have been described for A-Bruijn graphs 
as microblocks [38] and also for Enredo graphs both 
implicitly in the "joining" operation and explicitly as a 
first type of "aberrant homologies" [39] . Furthermore, we 
view another type of "aberrant homologies" from Enredo 
graphs as a special case of this substructure: retrotrans- 
posed pseudogenes that cause a series of successive visit- 
ing blocks. 

Short cycles. Cycles in genome alignments are indi- 
cators for rearrangement. A change in one of two 
identical genomes often introduces a cycle in the corre- 
sponding genome alignment. In the same way, spurious 
similarity causes cycles and breaks colinearity. If there 
are many cycles, they often hide significant colinearity. 
For this reason, many genome aligners eliminate short 
cycles. 

Specific types of cycles also play a role for colinear 
sequence alignment. For example, alignment graphs with- 
out mixed cycles are colinear alignments [33]. Thus, we 
can compute colinear alignments by eliminating mixed 
cycles from alignment graphs. Similarly, the partial order 
alignment (POA) program [44] uses directed acyclic 
graphs (DAGs) for alignment representation, essentially 
A-Bruijn graphs without directed cycles. 

We define a genome alignment cycle as a sequence of 
blocks bi,...,bk where block b{ is adjacent to block bi+\ 
for all i = 1, 1 and b^ is adjacent to b\. Fur- 

ther, we require all sets of positions that define adjacencies 
between two blocks b{ and fej+i along the cycle to be dis- 
joint. Thereby we exclude pairs of adjacent blocks from 



the set of genome alignment cycles. A cycle is short if the 
total length of segments along the cycle is below a given 
length threshold. 

The definition of genome alignment cycles corresponds 
to simple mixed cycles in the Enredo graph structure. 
They mostly appear in the A-Bruijn graph structure and 
alignment graph structure as (mixed) simple cycles, too, 
but there is no one-to-one correspondence: The align- 
ment graph structure can have more than one cycle for 
a single genome alignment cycle (see Figure 9A); and 
genome alignment cycles that are caused by inversions 
are not visible in the alignment graph structure and 
A-Bruijn graph structure. Figure 9B shows an example 
for two genome alignment cycles that appear as a sin- 
gle cycle in the A-Bruijn graph structure. Despite these 
essential differences, cycles in the alignment graph struc- 
ture, A-Bruijn graph structure, and Enredo graph struc- 
ture always correspond to genome alignment cycles as 
opposed to cycles in the cactus graph structure. Subgraphs 
in the structures of alignment graphs, A-Bruijn graphs, 
and Enredo graphs that correspond to cycles in the cac- 
tus graph structure are not even necessarily connected 
(see below). 

In the following, we discuss two characteristics for dis- 
criminating between different types of cycles, the orien- 
tation of adjacencies and the number of maximal colinear 
paths. Next, we briefly address the special case of palin- 
dromes. In addition, we describe how simple cycles in 
cactus graphs are used as characteristic substructures 
although they differ from genome alignment cycles. 

Orientation of adjacencies, A-Bruijn graphs represent 
adjacencies as directed edges. This allows classifying 
cycles into those that follow the direction of edges and 
those that ignore the direction of edges. Pevzner and 
colleagues refer to the two types of cycles as whirls and 
bulges [19,35]. Whirls are directed, and bulges ignore the 
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Figure 9 Genome alignment cycles have a one-to-one correspondence only in the structure of Enredo graphs. (A) A cycle in the Enredo 
graph structure may correspond to several overlapping cycles in the alignment graph structure. In this example, two cycles in the alignment graph 
structure are shaded in red and blue. (B) Cycles caused by inversions appear only in the Enredo graph structure. In this example, the upper cycle in 
the Enredo graph structure is due to an inversion in block A, hence, does not appear in the A-Bruijn graph structure. 



direction of edges. The graph-based genome aligner ABA 
addresses whirls and bulges in A-Bruijn graphs [19]. 

The classification of cycles in whirls and bulges becomes 
ambiguous when the graph represents multiple genomes. 
It depends on the initially chosen relative orientations of 
the genomes. If we invert a subset of the genomes, some 
whirls become bulges and some bulges become whirls 
(see Figure 10). Note that whirls and bulges have been 
first introduced for repeat resolution within one genome 
[35], where the classification in whirls and bulges is 
unambiguous. 

Number of maximal colinear paths, A genome align- 
ment cycle is formed by complete maximal colinear paths 
and possibly single additional blocks. For example, the 
cycle in Figure 10 is formed by the maximal colinear path 
consisting of the single block B and two additional blocks 
A and C. A and C may be part of longer maximal colinear 



paths. In contrast to the orientation of adjacencies, the 
number of maximal colinear paths classifies the cycles 
unambiguously [38]. 

The A-Bruijn graph based approach DRIMM-Synteny 
[38] uses a classification of cycles into one-way, two-way, 
and composite cycles, which is similar but not equivalent 
to a classification according to the number of maximal 
colinear paths. DRIMM-Synteny focuses only on one- 
way and two-way cycles even though there can be cycles 
formed by more than two paths. The "annealing" oper- 
ation in Enredo [39] places special emphasis on cycles 
formed by two maximal colinear paths after each of these 
paths has been joined to a single adjacency edge. In addi- 
tion, Enredo addresses all other cycles as the third type of 
"aberrant homologies". 

Palindromes, Palindromes in genomes are inverted tan- 
dem duplications. Hence, they traverse a duplicated block 




Figure 10 The classification of cycles into whirls and bulges depends on the orientation of genomes. In this example, inversion of the red 
genome transforms the cycle from bulge to whirl or vice versa. 
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twice and in both directions. Palindromes create a special 
type of cycles in genome alignments formed by only one 
adjacency at one end of a block. For the detection of palin- 
dromes and distinction against tandem repeats, inversion 
information is necessary. Thus, the structure of alignment 
graphs and A-Bruijn graphs alone cannot reveal palin- 
dromes. In Enredo graphs, we recognize palindromes by 
an adjacency edge loop (see Figure 11). Palindromes are 
separately addressed as "thorns" in A-Bruin graphs [38] 
and mentioned as "aberrant homologies" in Enredo [39]. 

Cactus chains. In cactus graphs, simple cycles are 
named chains [22] . The corresponding subgraphs of cac- 
tus chains in Enredo graphs, A-Bruijn graphs, and align- 
ment graphs are not necessarily connected (see blocks 
A, E, I, K in Figure 4). But even though chains do not 
correspond to continuous segments of genomes, they rep- 
resent conserved orders of blocks (e. g., blocks A, £, /, K 
in Figure 4 appear in this order in all genomes). Cactus 
chains are addressed by the Cactus method. 

It is possible to identify the subset of blocks forming 
a cactus chain in the Enredo graph structure, for the 
simple reason that an Enredo graph can be transformed 
into a cactus graph. However, it appears impossible to 
characterize chains in Enredo graphs without comput- 
ing e. g., 3-edge connected components. In the structure 
of alignment and A-Bruijn graphs, information about the 
orientation of adjacent blocks is missing for identifying 
cactus chains. 

Cactus groups. Paten et al. refer to adjacency edge con- 
nected components, which are computed for constructing 
a cactus graph, as groups [22]. A cactus group is a set 
of adjacencies that forms an adjacency-edge connected 
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Figure 1 1 Palindromes and tandem repeats are not 
distinguishable in the structure of alignment and A-Bruijn 
graphs. Depending on the relative orientation of the segments in 
block A, the displayed alignment graph structure and A-Bruijn graph 
structure represents a palindrome or a tandem repeat. In Enredo 
graphs, palindromes and tandem repeats form distinct substructures. 



component in the Enredo graph. All adjacencies of one 
group are represented by one vertex in the cactus graph 
structure, but a cactus graph vertex can represent several 
groups. Strictly speaking, cactus groups are visible in the 
structure of Enredo graphs but not in the structure of cac- 
tus graphs. Similarly, it is not possible to recognize groups 
in the alignment graph and A-Bruijn graph structures 
because this requires information about the orientation of 
adjacent blocks. 

In summary, inversion and duplication information is 
necessary for the complete detection of all substructures. 
Visiting blocks require duplication information, and all 
other substructures require inversion information. Hence, 
none of the four graphs reveals all substructures solely by 
its structure. 

This concludes our classification of substructures on the 
basis of a not necessarily exhaustive list of substructures. 
Identification of further substructures or an assessment 
of their relevance for the accuracy of genome alignments 
may possibly point towards another way of classifying them. 

Modifications 

Graph-based genome aligners modify the genome align- 
ments by eliminating substructures from the graphs. The 
aim is to reveal long conserved homologies, i. e., blocks 
of large size and length. As mentioned in the introduc- 
tion, genome alignment comprises selection of local align- 
ments and segmentation. Here, we describe modifications 
that eliminate substructures either by modifying the set of 
local alignments represented in blocks ("splitting blocks" 
and "merging parallel blocks") or by determining break- 
point positions that will be part of the final segmentation 
("merging consecutive blocks" and "cutting adjacencies"). 

These four modifications derive from the mentioned 
graph-based genome alignment approaches, but they 
match the operations described for the approaches only in 
part. Some genome alignment approaches clearly separate 
block modification and segmentation, other approaches 
do both tasks together. Similarly, some approaches apply 
compound operations consisting of several of the modifi- 
cations described here. Our intention is to provide small 
modification entities from which it is possible to assemble 
more complex operations. 

We describe every modification on the set of blocks 
(not on the level of alignment components but on the 
level of segments). Furthermore, we mention effects of 
the modifications in the graph structures although they 
can be applied to a genome alignment independently 
from a graph structure. We explain how these modi- 
fications correspond to operations in the graph-based 
genome alignment approaches, especially if the corre- 
spondence is not obvious. For example, this is the case for 
DRIMM-Synteny [38], which solves the sequence modifi- 
cation problem (SMP) on A-Bruijn graphs. The method 
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Figure 12 The effect of splitting a block on the graphs. In 

alignment graphs, removal of block edges splits a block if the removal 
disconnects a block edge connected component. In A-Bruijn graphs 
and Enredo graphs, vertices (and block edges) need to be multiplied. 
In cactus graphs, the effect depends on the context. We show only 
the simplest possibility, where an edge is multiplied. 



modifies the sequences and determines the segmenta- 
tion on the modified sequences before transforming the 
sequences back. We transfer the effects directly to the 
original sequences and set of blocks, and refer to the 
modifications accordingly. The four modifications cover 
all operations described in the programs ABA, DRIMM- 
Synteny, Enredo, and Cactus. 

Splitting blocks. The most prevalent modification is 
splitting a block by dividing its set of segments into sub- 
sets that form new smaller blocks. Formally, the modifi- 
cation replaces a block b = {si,...,s n }, where n > 2 
is the size of b, by two blocks b\ = and 
&2 = {sk+i> • • ->Sn}> 1 < k < n. The new blocks may 
have size 1, thus may consist of a single segment. Trans- 
ferred to the original set of local alignments from which 
the blocks were formed, this modification corresponds to 
removing local alignments. In some cases, it is enough to 
remove a single pairwise local alignment to split a block 
into two blocks. In other cases, a particular subset of the 
local alignments needs to be removed simultaneously. 

Splitting blocks has different effects on the genome 
alignment graphs (see also Figure 12). In the alignment 
graph structure, the splitting corresponds to removing all 
edges between two vertex subsets of a block edge con- 
nected component. In A-Bruijn graph structures, where 
vertices represent blocks, the modification replaces a ver- 
tex by two new vertices; incoming and outgoing edges are 
connected to the respective new vertex. The effect of split- 
ting blocks in Enredo graph structures is very similar: The 
modification duplicates a pair of head vertex and tail ver- 
tex connected by a block edge, and reconnects incoming 
and outgoing adjacency edges accordingly. In cactus graph 
structures, the splitting of block edges can lead to com- 
plex rearrangements with both splitting and merging of 
vertices. 

Graph-based genome aligners eliminate many substruc- 
tures using this modification. By splitting blocks, we can 
clearly modify a graph so as to eliminate visiting blocks 
(see Figure 8), which is done in the programs DRIMM- 
Synteny and Enredo. While Enredo splits a block into 
two blocks of arbitrary size, DRIMM-Synteny splits sin- 
gle segments from a block, thus creating blocks of size 1. 
Additionally, this modification can eliminate small cycles, 
e. g., whirls in A-Bruijn graphs [35] and mixed cycles in 
alignment graphs. Further, the sequence modifications 
in DRIMM-Synteny for one-path cycles and palindromes 
result in the splitting of segments from blocks. Finally, the 
"melting" operation in the Cactus method splits all blocks 
along a cactus chain into blocks of size 1. 

Merging parallel blocks. The opposite to block split- 
ting is a modification that merges blocks by adding local 
alignments between segments of the blocks. To merge 



two blocks b\ = . . . ,s/J and b<i = {s^+i* • • -> s n} of 
size k and size n — k, respectively, into a new block b = 
{si,...,s„} of size n, it is sufficient to add a local align- 
ment of two segments s; e b\ and sj e Such local 
alignments can be new or previously removed by split- 
ting blocks. Note that merging of parallel blocks implicitly 
aligns all segments of the two blocks. 

The effect on the graph structures is the reverse of block 
splitting. In the alignment graph structure, it corresponds 
to adding block edges. In the A-Bruijn graph structure 
two vertices are replaced by a single vertex. In the Enredo 
graph structure, two block edges with head and tail ver- 
tices are being replaced by a single block edge with one 
head and one tail vertex. In the cactus graph structure, 
merging of parallel blocks can lead to complex rearrange- 
ments just as splitting of blocks. The result is typically a 
longer chain or a new sub-cactus. 

Graph-based genome alignment approaches usually 
merge blocks based on the structure of surrounding 
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blocks. Two-way cycles and bulges in A-Bruijn graphs and 
Enredo graphs are substructures that suggest to merge 
parallel blocks [38,39]. Furthermore, the genome seg- 
ments within cactus groups are more likely to be homolo- 
gous than others, hence, subject to merging [22]. Both in 
Enredo and in cactus graphs, the modification is termed 
"annealing". 

Merging consecutive blocks. The preceding two mod- 
ifications often generate new or longer colinear paths. It 
is possible to replace the consecutive blocks of a colinear 
path by a new longer block that rules out the possibility 
of a breakpoint between the merged blocks. The mod- 
ification replaces two adjacent blocks b\ = {si, . . .,s n ] 
and &2 = {s' v . . .,s' n } without breakpoint in between by a 
longer block b that is formed by the concatenation of all 
adjacent segments s; and s f t where i = l,...,n. Merging 
consecutive blocks does not directly affect the alignment 
of the genomes, but simplifies the graph structures and 
also genome segmentation. 

The effects on the graphs are straightforward. In the 
alignment graph structure, a single vertex replaces each 
pair of vertices in two adjacent block edge connected 
components. In the A-Bruijn graph structure, one vertex 
replaces two consecutive vertices. In the Enredo graph 
structure, one block edge replaces a path consisting of a 
block edge, adjacency edge, and another block edge. And 
similarly in the cactus graph structure, one block edge 
replaces a path of a block edge, a vertex, and another block 
edge, thereby reducing the number of vertices in a chain. 

Merging consecutive blocks is part of the "joining" oper- 
ation in the Enredo method [39]. The other approaches do 
not apply this modification. 

Cutting adjacencies. As opposed to merging consecu- 
tive blocks, the last modification fixes a breakpoint in the 
genome alignment by cutting genomes into several seg- 
ments. For example, given a block b = {s\, . . .s n ] with 
si = (pi, qi) where i = 1, . . . , n and with a breakpoint 
at the tail of b, the modification cuts the genomes at all 
positions The modification does not affect the set of 
blocks but rather the set of genomes. Thus, it is part of the 
genome segmentation process. 

Cutting adjacencies corresponds to removing a single 
edge from an A-Bruijn graph structure, a single adjacency 
edge from an Enredo graph structure, or a set of adjacency 
edges from an alignment graph structure. Again, there 
are multiple possible effects in a cactus graph structure. 
In the simplest case, the cactus graph structure remains 
unchanged. In all graphs, the removal of edges can discon- 
nect the graph structures, generating several components 
that correspond to disjoint sets of genome segments. 
Thus, it can become impossible to thread the genomes 
through the graphs without additional effort [35,38]. 



Cutting adjacencies is used in various ways by genome 
alignment approaches. The ABA method cuts adjacen- 
cies for eliminating bulges from A-Bruijn graphs and 
the Enredo method for eliminating small cycles in gen- 
eral. In addition, the segmentation processes in A-Bruijn 
and Enredo graphs implicitly use this modification: In 
DRIMM-Synteny, segmentation is realized by coloring the 
graph. In Enredo, it is realized by excluding adjacencies 
shorter than a given length threshold. Genome segmenta- 
tion in alignment graphs and cactus graphs has not been 
described explicitly. 

Discussion and conclusions 

We compared four graph data structures and their usage 
for genome alignment. Our comparison identified that 
essential pieces of information about duplication and 
inversion are only present in the structures of some 
graphs. In addition, we examined substructures in the 
graph structures that are subject to elimination in vari- 
ous genome alignment approaches, and determined four 
classes of substructures. We found that information about 
duplications or information about inversions or even 
both are necessary for distinguishing any type of sub- 
structure in the graphs. Thus, it is indeed essential to 
keep additional information in labels of the vertices or 
edges, though the different graphs depend on the labels 
to a lesser or greater extent. Finally, we reduced the set 
of operations applied for eliminating substructures from 
the graphs to four elementary modifications. Overall, it 
became apparent that many ideas are shared by all graph- 
based approaches. 

These shared ideas allow us to derive a framework for 
graph-based genome alignment (see also Figure 13), an 
ABC to G-enome alignment. It begins with the com- 
putation of local colinear alignments among the input 
genomes (A). The choice of the local alignment method 
is mostly independent from the following steps though 
it influences the resulting genome alignments. Combin- 
ing local alignments to blocks, we can build a graph 
(B). Which graph to choose depends on the respective 
importance of different substructures for an application. 
Next, a graph-based genome alignment approach always 
characterizes a set of graph substructures (C). Substruc- 
tures sometimes have equivalences in other graphs, but 
may as well be distinguishable in the structure of only 
one graph. Detection of all substructure occurrences (D) 
is a requirement for their subsequent elimination (E). 
Elimination is accomplished by modifying the underly- 
ing set of blocks and sometimes also by introducing 
breakpoints in the genomes. The breakpoints determine 
already parts of a genome segmentation, which is fin- 
ished in a last step (F). The segmentation together 
with the modified set of blocks defines the genome 
alignment (G). 
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Figure 13 An ABC to graph-based genome alignment. These six 
steps lead to the genome alignment in all of the compared 
graph-based genome alignment approaches. 



This framework describes the main procedure of graph- 
based genome alignment. Still, it has limitations and there 
are additional problems to be solved. One such problem 
addresses blocks and occurs before building a graph. If 
we do not break up the local colinear alignments into 
alignment components, blocks may in general partially 
overlap. It is possible to resolve overlapping blocks by 
trimming [45] or avoid overlaps by requiring local align- 
ments to be sparse [39]. A good alternative, which is for 
example used by the genome aligner Mugsy [21], is to 
obtain a set of mutually disjoint blocks by refining seg- 
ment matches [46] . A segment match refinement resolves 
overlaps through modest computation without losing any 
alignment information. 

In addition, the generation of blocks (multiple align- 
ments) from pairwise alignments may pose a problem. 
There are only few exceptions of genome aligners that 
avoid the problem by directly computing local multiple 
alignments [45,47] . If we assume transitivity of the align- 
ment relation, it is straightforward to go from local pair- 
wise alignments to alignment components or to multiple 
ungapped alignments. In the case of gapped alignments 
however, pairwise alignments can have conflicting gap 
patterns. This complicates the task of combining them 
to a single block. Heuristic methods such as progressive 
alignment [7] or transitive alignment [9] carry out this 
task, but are time consuming. Having said that, a colinear 



realignment for each block carried out after finishing seg- 
mentation has proven to significantly improve alignment 
accuracy [45,48]. This suggests the alternative to ignore 
gaps in blocks while improving the genome alignment on 
the level of blocks. 

Further, we have not covered all aspects of the frame- 
work in this paper and left out details on the detection 
of substructures. For example, ABA and DRIMM-Synteny 
detect small cycles by efficiently computing a maximum 
spanning tree before heuristically inspecting the remain- 
ing edges that create cycles. Different detection methods 
clearly have an influence on the time complexity of an 
approach and, depending on their sensitivity, also on the 
accuracy of a genome aligner. Thus, a thorough analysis of 
detection methods is certainly interesting but beyond the 
scope of this work. 

Similarly, we have not addressed algorithms for elim- 
inating substructures. These algorithms determine the 
order in which modifications are applied. The elimination 
of one type of substructures can create other substruc- 
tures, which again can create the first type of substruc- 
tures upon elimination. For this reason, iterative elim- 
ination strategies are prevalent in graph-based genome 
aligners. End criteria for iteration are typically given as 
parameters of the method, e. g., a maximal length of cycles 
or an explicit number of iterations. 

The parameters usually require customized values for 
every new input set of genomes. Usually, this inhibits 
broad usage of tools if automatic parameter selection is 
not offered. A genome aligner has to find a trade-off 
between size and length of blocks. Very similar genomes 
will have long blocks conserved across many genomes, 
whereas more diverged genomes show fewer long blocks 
and conservation across fewer genomes. Hence, a factor 
to consider for parameter selection is genome divergence 
in addition to genome lengths. Given the initial set of local 
alignments, automatic parameter selection seems possi- 
ble. It will be necessary to carefully study the influence of 
all factors to be able to automate the selection, but conse- 
quently it will enable a larger community to benefit from 
graph-based genome aligners. 

Finally, graph-based genome aligners, just as other 
genome aligners, have to decide between positional 
homology alignment [49] or alignment of all repeats. 
More precisely, they have to decide, for segments with 
multiple copies in several genomes, whether to align them 
in one or in multiple blocks. Not only do repeats lead to 
a quadratic explosion in the number of pairwise align- 
ments, but they also hide larger regions of colinearity. For 
this reason, several genome aligners aim at aligning less 
and predict positional homology [17,21,45]. Graph-based 
genome aligners compute positional homology to a cer- 
tain degree. They do not forbid duplications, but separate 
blocks into positional homologs when splitting blocks. 
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In conclusion, our framework demonstrates shared 
aspects of graph-based genome aligners. It contributes 
to developing a common view on graph-based genome 
alignment, an active field of research with currently at 
least two graph-based tools for genome alignment being 
actively developed [50,51]. In the future, we might iden- 
tify the steps that have the greatest influence on alignment 
accuracy. Already now, we believe that the framework pro- 
vides assistance for the development of new and improved 
genome aligners. 
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